Survival Risk Prediction of Esophageal Squamous Cell Carcinoma Based on BES-LSSVM

Esophageal squamous cell carcinoma (ESCC) is one of the highest incidence and mortality cancers in the world. An effective survival prediction model can improve the quality of patients' survival. In this study, ten indicators related to the survival of patients with ESCC are founded using genetic algorithm feature selection. The prognostic index (PI) for ESCC is established using the binary logistic regression. PI is divided into four stages, and each stage can reasonably reflect the survival status of different patients. By plotting the ROC curve, the critical threshold of patients' age could be found, and patients are divided into the high-age groups and the low-age groups. PI and ten survival-related indicators are used as independent variables, based on the bald eagle search (BES) and least-squares support vector machine (LSSVM), and a survival prediction model for patients with ESCC is established. The results show that five-year survival rates of patients are well predicted by the bald eagle search-least-squares support vector machine (BES-LSSVM). BES-LSSVM has higher prediction accuracy than the existing particle swarm optimization-least-squares support vector machine (PSO-LSSVM), grasshopper optimization algorithm-least-squares support vector machine (GOA-LSSVM), differential evolution-least-squares support vector machine (DE-LSSVM), sparrow search algorithm-least-squares support vector machine (SSA-LSSVM), bald eagle search-back propagation neural network (BES-BPNN), and bald eagle search-extreme learning machine (BES-ELM).


Introduction
Cancer is one of the leading causes of human death in both developed and developing countries [1]. Esophageal cancer is the sixth leading cancer in the world, including esophageal squamous carcinoma and esophageal adenocarcinoma [2]. More than 90% of esophageal cancers are esophageal squamous cell carcinoma, and most of them are diagnosed in advanced stages [3]. e pathology of esophageal squamous cell carcinoma is complicated, and effective diagnosis and treatment strategies are lacking [4,5]. In recent years, the incidence of esophageal squamous cell carcinoma has been on the rise, and the mortality rate remains high [6].
At present, with the continuous deepening of human research, the treatment methods and treatment concepts of ESCC have been continuously improved [7][8][9]. However, there is still a lack of marker models and prognostic index that can accurately and effectively reflect the prognosis of ESCC patients [10]. Generally, TNM staging is considered to be the best prognostic indicator for ESCC. However, patients with the same TNM stage often have different prognoses [11]. e TNM staging alone cannot accurately determine the patient's risk of death [12]. erefore, it is important to establish a reasonable prognostic index.
In recent years, with the continuous progress of machine learning technology, more and more intelligent algorithms are proposed and applied in multiple fields [13][14][15][16][17][18][19]. A hybrid model of genetic algorithm (GA) and least-squares support vector machine (LSSVM) is used by Ahmadi and Chen [20] to predict the relevant experimental permeability reduction ratio due to scale deposition during water injection, and the results confirm the validity of the GA-LSSVM model. LSSVM is used by Ahmadi and Pournik [21] to build a predictive model for determining the chemical flooding efficiency of the oil reservoir, and the results show that the model has good stability and reliability. In [22], a method based on local mean decomposition and improved FAoptimized combined kernel least-squares support vector machine is proposed to predict short-term wind speed. e results show that the proposed LMD-FA-LSSVM model has better prediction performance.
In the medical field, the doctors' diagnosis is effectively aided by the application of many new algorithms. A combined classification and regression approach is proposed by Zhu et al. [23] for early diagnosis of COVID-19 and prediction of time to conversion in patients with severe symptoms. e results show that the accuracy of the proposed method in predicting severe cases reached 76.97% with a correlation coefficient of 0.524. In [24], a method combining extreme learning machine and gain ratio feature selection method is proposed and tested on the Wisconsin Breast Cancer Diagnostic (WBCD) dataset. e experimental results show that the accuracy of the proposed method reaches 0.9868. e genetic algorithm is used by Majid et al. [25] to select the best features and then use an ensemble classifier to predict gastric infections. e results show that the proposed method performs better than existing methods. In addition, random forest [26], extreme learning machines [27], BP neural networks [28,29], and Elman neural networks [30] have achieved satisfactory results in the prognosis and diagnosis of certain cancers.
Compared with the above studies [24,25,27,28] that mostly use genetic information and image information to predict patient mortality, the proposed work mainly has the following advantages. First, the patients' blood indicators and TNM staging indicators are used to predict the patient's survival status. Second, an effective prognostic index is established, which significantly improved the performance of the prediction model. ird, these machine learning algorithms rarely distinguish between patients of different ages. Due to differences in patient age, it is difficult for a single model to accurately predict the survival risk of all patients. erefore, the goal of this article was to find a new set of indicators related to the survival of ESCC patients based on the patient's blood indicators and TNM staging information, establish reasonable prognostic indicators, and combine new machine learning techniques to predict the survival rate in patients of different ages.
In this study, seventeen blood indicators, age, and TNM staging information of 360 patients with ESCC are studied. Ten indicators related to patient survival are found through the feature selection method of genetic algorithm. e combination of these ten indicators has a significant correlation with the patient's survival, which is verified by the Cox regression method in the SPSS software. Using the binary logistic regression method, the prognostic index (PI) of patients with ESCC is constructed. e prognostic index (PI) is divided into four stages, and the different survival conditions of patients can be reasonably reflected in each stage. Comparing the PI staging system with the traditional TNM staging system, the results show that the PI staging system has a better AUC value. e ROC curve method is used to determine the critical threshold of patient age, and the patients are divided into the high-age groups and the low-age groups. en, based on the Kaplan-Meier survival analysis, it is concluded that the low-age group has a better survival rate than the high-age group, which effectively reflects the survival status of different patients. Finally, the bald eagle search algorithm-least-squares support vector machine (BES-LSSVM) survival prediction model is further proposed in this study. e bald eagle search algorithm is used to optimize the parameters of the least-squares support vector machine, which improves the prediction accuracy of the model. e prognostic index (PI) and the above ten related indicators are used as inputs, and the five-year survival rate of the patient is used as output. e prediction accuracy rate of BES-LSSVM is better than the existing PSO-LSSVM, GOA-LSSVM, DE-LSSVM, SSA-LSSVM, BES-BP, and BES-ELM. erefore, the method for survival prediction of patients with ESCC proposed in this study can accurately predict the survival level of patients. e purpose of this article was to propose prognostic indicators PI and survival prediction models based on blood indicators and TNM staging information of patients with ESCC. Based on genetic algorithm feature selection, binary logistic regression, ROC curve, Kaplan-Meier survival analysis, Cox regression analysis, and BES-LSSVM, a method for predicting the survival risk of patients with ESCC is proposed. e main contributions of this article can be summarized as follows: (1) A combination of ten indicators is found based on genetic algorithm feature selection, which is verified to be significantly associated with survival in patients with ESCC. (2) e prognostic index of patients with ESCC is constructed by the binary logistic regression method, which can reasonably reflect the survival of patients at different stages. is work is presented as follows. In Section 2, the original data are analyzed, a combination of multiple indicators that is significantly related to patient survival is found, and prognostic index is constructed. e survival risk of patients of different ages is obtained. In Section 3, the bald eagle search-least-squares support vector machine is proposed, and the five-year survival rate of patients with ESCC is effectively predicted. In Section 4, the conclusions of this article are presented.  Table 1. Information on seventeen blood indicators is shown in Table 2.

Feature Selection Based on Genetic Algorithm.
A genetic algorithm (GA) is a global optimization adaptive probability search algorithm [31]. GA has the characteristics of group search, which makes it easy to jump out of the local optimum [32]. erefore, it is often selected as the search algorithm with better feature selection. In many studies, GA is used as a wrapper feature selection technique [33]. In this study, 17 blood indicators and TNM staging information of patients with ESCC are used as independent variable, and the fiveyear survival rate of patients is used as dependent variable. e least-squares support vector machine is used as the classifier of genetic algorithm feature selection to evaluate the subset of features related to the survival rate of patients. e main process of multi-index feature extraction based on genetic algorithm feature selection (GA-FS) is as follows.
Step 1: the generation of the initial population A population is randomly generated as the first-generation solution of the problem. 17 blood indicators and TNM staging information of 360 esophageal cancer patients are selected as inputs and normalized to [− 1, 1] by the mapminmax function. e mapminmax function is calculated by the following equation: where y max is 1 and y min is − 1.
Step 2: coding individuals in the population e chromosome of each individual in the population is coded using a binary coding method, and each binary bit corresponds to each feature in the feature set. e initial characteristics include seventeen blood indicators, T staging, N staging, and TNM staging. In the value of each bit of the binary code, "0" indicates that the feature is not selected, and "1" indicates that the feature is selected. e dataset is divided into training set and test set.
Step 3: determine the fitness function e value of the fitness function indicates the pros and cons of the individual or solution. e purpose of genetic algorithm (GA) used for feature selection is to improve the classification accuracy of the least-squares support vector machine (LSSVM) while reducing the number of selected features as much as possible.
erefore, the fitness function is constructed as Fitness � α · R + β · M/N. R is the classification accuracy of the LSSVM classifier. M is the number of selected features. N is the number of all features. α is a scaling parameter, which reflects the proportion of classification accuracy in the fitness function. β is the parameter importance, which reflects the weight of the selected number of features in the fitness function, and α + β � 1.
Step 4: sort and select e fitness values are calculated and individuals in the population are selected using a roulette wheel algorithm as a selection operator. e greater the fitness (i.e., the higher the classification accuracy and the lower the number of features), the greater the probability that the individual will be selected for the next generation.
Step 5: crossover In this study, the crossover operation uses a two-point crossover operator, and the principle of the crossover operator is shown in Figure 1. Two crossover points are randomly set in the individual code string, and then, part of the gene exchange is performed. e crossover probability is generally 0.4 to 0.99, and the crossover probability selected in this study is 0.7.
Step 6: mutation Under the condition of meeting the set mutation probability, the individuals in the population are sequentially subjected to random bit mutation. In the genetic algorithm (GA), the value of the mutation probability is generally 0.001 to 0.1, and the mutation probability used in this study is 0.05.
Step 7: the fitness value is calculated e selected features are input into the LSSVM, and the fitness value is obtained by the ten-fold cross-validation method. If the current solution is better than the optimal solution, the optimal solution is updated.
When the maximum number of iterations is reached, the loop ends. To clearly express the GA-FS process, the framework of GA-FS is shown in Algorithm 1.
rough the feature selection results of genetic algorithm, the index combinations that are more relevant to patient survival can be obtained: T staging, N staging, TNM staging, WBC, EO, RBC, PLT, TP, PT, and INR. At this time, Computational Intelligence and Neuroscience the ten-fold cross-validation classification accuracy of LSSVM reaches the highest, and the value is 83.077 %.

e Correlation of Indicators Is Verified by Cox Regression
Analysis. e Cox regression model is a semiparametric regression model that can analyze the impact of multiple factors on survival [34]. erefore, it is widely used in the medical field. e "SPSS 22.0" statistical software is used to make the Cox model. e survival time and survival outcome of patients with ESCC are used as dependent variables. e above ten indicators are independent variables. e survival function at the mean of the covariate is shown in Figure 2. e results show that the p value of the overall score of the ten indicators is 0.000131 far less than 0.05. e combination of these ten indicators is significantly related to the survival rate of patients.

Evaluation and Establishment of Prognostic Indicators.
is section establishes and evaluates the prognostic index (PI) of patients with ESCC to better classify patients and provide good clinical guidance. In the above section, the ten indicators that are significantly related to the survival of patients are selected through genetic algorithm feature selection, which are T stage, N stage, TNM stage, WBC, EO, RBC, PLT, TP, PT, and INR. e binary logistic regression analysis [35] is used to construct the prognostic index. e patient's survival status is used as the dependent variable, and ten indicators are used as independent variables. e prognostic index of ESCC is constructed by the following equation: e receiver operating characteristic (ROC) [36] curve is usually used to select the best diagnostic threshold and divide the indicators into two categories. e ROC curve of PI is shown in Figure 3(a). e AUC value is 0.660, p < 0.001, indicating that PI has a high predictive value for the prognosis of ESCC patients. e comparison of ROC curves between PI and TNM staging systems is shown in Figure 3(b). e comparison results of PI and TNM are shown in Table 3. By analyzing and comparing the ROC curves of PI and TNM, it can be concluded that the predictive effect of the prognostic index PI in this study is better than that of the TNM staging system.     Computational Intelligence and Neuroscience To better predict the survival status of ESCC patients, the ROC curve is further analyzed to determine the best cutoff value of PI. e PI values of all samples are used as inputs, and the ROC curve is drawn, as shown in Figure 3. e value of the area under the curve is 0.660, which is greater than 0.5, P < 0.001. Obviously, there is a threshold for PI. By calculating the Youden index, PI can be divided into two levels.
e Youden index is calculated by the following equation: e Youden index is calculated as 0.303. e Youden index, AUC value, significance, and other related indicators are shown in Table 4. en, for samples with PI values higher than 0.303 and samples with PI values lower than 0.303, ROC curves are drawn, as shown in Figure 4. e Youden index, AUC value, significance, and other related indicators are shown in Table 4. It can be seen from Table 4 that the AUC values of the three ROC curves are all greater than 0.5, and the significance P value is less than 0.05.
According to the ROC curve, the three critical thresholds of PI can be obtained in sequence. e three critical thresholds are 0.303, 0.016, and 0.873, respectively. According to the critical threshold, PI is divided into four stages, namely PI-I, PI-II, PI-III, and PI-IV. e four stages of PI are analyzed by the Kaplan-Meier, and the results are shown in Figure 5. According to the Kaplan-Meier analysis [37], PI-I has the best prognostic effect, which is better than PI-II, PI-III, and PI-IV for patients with ESCC.   Computational Intelligence and Neuroscience important influence on the physiological immunity of the patient, and it is related to the patient's tolerance to different treatment methods. erefore, differences in age factors will also lead to different prognoses of ESCC patients. It is important to construct different survival prediction models for patients of different ages. e ROC curve is used to determine the best cutoff value of the patient's age. It is plotted with the age of all samples as the variable, named "ROC of the patient's age," as shown in Figure 6. e area under the curve (AUC) value is 0.618, which is greater than 0.5, and P < 0.001. Obviously, a critical threshold can be found for age, which divides age into two risk levels.

Divide Risk Levels
After calculating the Youden index, the critical threshold of age is 61.5 years. By calculating critical thresholds, patients are divided into the high-and low-age groups. e Kaplan-Meier survival analysis is performed based on the high-and low-value groups of age, and the results are shown in Figure 7.
ere is a significant difference between the high-age group and the low-age group (P < 0.05) on survival rate, and the low-age group has a better survival rate than the high-age group.

Bald Eagle Search Algorithm-Least-Squares Support Vector Machine.
e bald eagle search algorithm (BES) is proposed by Alsattar et al. [38]. It is a meta-heuristic optimization algorithm based on the behavior strategy or social behavior of the bald eagle during hunting. e algorithm has strong global search capabilities and can effectively solve various complex numerical optimization problems. In this study, the bald eagle search algorithm is used to optimize the parameters of the least-squares support vector machine, which improved the prediction accuracy of the least-squares support vector machine. e survival rate of ESCC patients    is predicted based on the proposed BES-LSSVM classification prediction model. e bald eagle search algorithm is mainly divided into three stages, namely select stage, search stage, and swooping stage.

Select Stage.
In the select stage, the bald eagles will select the best area (according to the amount of food) within the selected search area and start looking for prey. At this time, the position P of the bald eagle is determined by multiplying the a priori information of the random search by α. e mathematical model of this behavior is constructed as follows: where α is used to control the position change parameter within the range of (1.5, 2); r is a random number between (0, 1). P best represents the best position of the bald eagle based on the previous search. P mean is the average position of the bald eagle after the previous search. P i represents the position of the ith bald eagle.

Search Stage.
In the search stage, the bald eagles fly in different directions in a spiral shape, speeding up the search for prey. en, the bald eagle will look for the best position in the selected space to swoop and hunt. e position update of the bald eagle during spiral flight adopts the form of polar coordinate equation, as follows: , where a and R are the parameters in the range of (5, 10) and (0.5, 2), respectively, which are used to control the spiral regression trajectory. θ(i) and r(i) are the polar angle and polar diameter of the spiral equation, respectively. x(i) and y(i) represent the position of the bald eagle in polar coordinates, and the values are both (− 1, 1). xr(i) and yr(i) represent the position of the bald eagle in the Cartesian coordinate system. rand is a random number (0, 1). e location of the bald eagle is constructed as follows:

Swooping Stage.
In the swooping stage, the bald eagles quickly swoop from the best position in the search space to their target prey. At the same time, other individuals in the population move to the best position and attack the prey. e state of motion of the bald eagle is described by the polar coordinate equation: .
e formula for updating the position of the bald eagle during swooping is constructed as follows: where c 1 and c 2 increase the exercise intensity of the bald eagle to the optimal point and the center point, and the value range is (1, 2). For LSSVM, the choice of kernel function is a key factor. e RBF kernel function is selected in this study, and the RBF kernel function can be expressed as follows:  Computational Intelligence and Neuroscience where g is the parameter coefficient of the kernel function, which affects the performance of LSSVM.
In this study, to improve the classification accuracy of LSSVM, BES is selected to optimize the penalty factor c and the kernel function parameter g of LSSVM. e classification error rate of LSSVM is used as the objective function of BES optimization, and the objective function is fitness function � 1 − classification error rate. e larger the fitness value, the higher the classification effect of LSSVM.
To clearly express the BES-LSSVM process, the framework of BES-LSSVM is shown in Algorithm 2.

Survival Prediction of Esophageal Squamous Cell
Carcinoma. Ten indicators related to the survival rate of ESCC patients are obtained through the method of genetic algorithm feature selection. ese indicators are T stage, N stage, TNM stage, WBC, EO, RBC, PLT, TP, PT, and INR. e prognostic index PI of ESCC patients is obtained by the binary logistic regression. e eleven indicators of patients are used as inputs to the BES-LSSVM model, and the fiveyear survival rate of the patients is used as the output. Survival prediction models for ESCC patients in the high-age group and the low-age group are established separately. e framework of the overall implementation of the survival prediction model for patients with ESCC is shown in Figure 8. To verify the validity of this model, grasshopper optimization algorithm-least-squares support vector machine (GOA-LSSVM) [39], particle swarm optimizationleast-squares support vector machine (PSO-LSSVM) [40], differential evolution-least-squares support vector machine (DE-LSSVM) [41], sparrow search algorithm-least-squares support vector machine (SSA-LSSVM) [42], bald eagle search-back propagation neural network(BES-BPNN), and bald eagle search-extreme learning machine(BES-ELM) are used for comparison.
For the parameter setting of the bald eagle search algorithm, the bald eagle population number is set to 20, and the number of iterations is set to 100. For the particle swarm algorithm, both c 1 and c 2 are set to 1.5. e population size is set to 20, and the number of iterations is set to 100. For the grasshopper optimization algorithm, the population size is set to 20, and the maximum number of iterations is set to 100. For differential evolution algorithm, the scaling factor F is set to 0.5, the crossover probability CR is set to 0.9, and the maximum number of iterations is set to 100. For the sparrow search algorithm, the population size is set to 20, the safety value is set to 0.6, and maximum number of iterations is set to 100. e dataset is divided into ten parts, and the ten-fold cross-validation method is used to verify the performance of the model. Nine samples are used as the training set, and one sample is used as the validation set. e cross-validation is repeated 10 times, and the average of the ten results is obtained.
is method enables training and testing with random samples repeatedly, and the results are verified once each time. e effect of boundary patient data on the performance of the least-squares support vector machine is effectively reduced. e evaluation metrics include classification accuracy, sensitivity, specificity, and running time.  Table 5. e optimal LSSVM model parameters under different optimization methods are shown in Table 6.
It can be seen from Table 5   predict the five-year survival rate of ESCC patients. In terms of sensitivity and specificity, the proposed BES-LSSVM also outperforms other models. Besides, it can be seen from Table 5 that BES-LSSVM has the fastest running time.
To better demonstrate the effectiveness of the proposed model, the Wisconsin Diagnostic Breast Cancer (WBCD) dataset is used for testing, and the results are shown in Table 7. From the test results, it can be seen that BES-LSSVM has higher prediction accuracy and faster running time than other models. erefore, the survival status of cancer patients can be effectively predicted by the survival prediction model proposed in this study.

Conclusions
To accurately and effectively predict the five-year survival rate of patients with ESCC, a survival prediction model based on genetic algorithm feature selection, binary logistic regression, and least-squares support vector machine is proposed in this study. A genetic algorithm and Cox regression are used to determine ten indicators that are significantly related to the survival of patients with ESCC. Based on the binary logistic regression, a prognostic indicator PI with predictive value is constructed. Patients are divided into the high-age groups and the low-age groups by ROC curve analysis. rough the Kaplan-Meier survival analysis, it is concluded that the low-age group has a better survival rate than the high-age group. e bald eagle search algorithm-least-squares support vector machine (BES-LSSVM) is further proposed, which effectively predicts the five-year survival rate of patients with ESCC. e accuracy of BES-LSSVM in predicting the five-year survival of patients with ESCC is better than the existing GOA-LSSVM, PSO-LSSVM, DE-LSSVM, SSA-LSSVM, BES-BPNN, and BES-ELM.
is reflects the good practical value of the ESCC survival prediction model proposed in this study in the field of cancer classification prediction.
However, the accuracy of the model may be affected by increase in number of samples and classes. Moreover, sometimes, it is a possibility that during the feature selection process, few important features are discarded. In the future, the combination of swarm intelligence optimization algorithm and the latest deep learning models (such as deep neural network and convolutional neural network) will be used to develop a new survival prediction model for patients with ESCC on a larger and more complex dataset.

Data Availability
e data used to support the findings of the study can be obtained from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of financial interests or personal relationships that could have appeared to influence the work reported in this study.